Hardware Sizing
Hardware Sizing (Standalone Install)
Small Tier - 16 Core, 128G RAM (r5.4xlarge / E16s v3)
Component | RAM | Cores |
---|---|---|
Web | 2g | 2 |
Postgres | 2g | 2 |
Spark | 100g | 10 |
Overhead | 10g | 2 |
Medium Tier - 32 Core, 256G RAM (r5.8xlarge / E32s v3)
Component | RAM | Cores |
---|---|---|
Web | 2g | 2 |
Postgres | 2g | 2 |
Spark | 250g | 26 |
Overhead | 10g | 2 |
Large Tier - 64 Core, 512G RAM (r5.16xlarge / E64s v3)
Component | RAM | Cores |
---|---|---|
Web | 4g | 3 |
Postgres | 4g | 3 |
Spark | 486g | 54 |
Overhead | 18g | 4 |
Important Collibra DQ requires a limit of 2TBs for large tier jobs. For DQ jobs that exceed 2TBs, you must filter down columns or rows.
Estimates
Sizing should allow headroom and based on peak concurrency and peak volume requirements. If concurrency is not a requirement, you just need to size for peak volume (largest tables). Best practice to efficiently scan is to scope the job by selecting critical columns. See Scaling your DQ Job for more information.
Bytes per Cell | Rows | Columns | Gigabytes | Gigabytes for Spark (3x) |
---|---|---|---|---|
16 | 1,000,000.00 | 25 | 0.4 | 1.2 |
16 | 10,000,000.00 | 25 | 4 | 12 |
16 | 100,000,000.00 | 25 | 40 | 120 |
16 | 1,000,000.00 | 50 | 0.8 | 2.4 |
16 | 10,000,000.00 | 50 | 8 | 24 |
16 | 100,000,000.00 | 50 | 80 | 240 |
16 | 1,000,000.00 | 100 | 1.6 | 4.8 |
16 | 10,000,000.00 | 100 | 16 | 48 |
16 | 1,000,000,000.00 | 100 | 1600 | 4800 |
16 | 100,000,000.00 | 100 | 160 | 480 |
16 | 1,000,000.00 | 200 | 3.2 | 9.6 |
16 | 10,000,000.00 | 200 | 32 | 96 |
16 | 100,000,000.00 | 200 | 320 | 960 |
16 | 1,000,000,000.00 | 200 | 3200 | 9600 |
Cluster
If your program requires more horsepower or (Spark) workers than the example tiers above which is fairly common in Fortune 500 companies than you should consider the horizontal and ephemeral scale of a cluster. Common examples include Amazon EMR and Cloudera CDP. Collibra DQ is built to scale up horizontally and can scale to hundreds of nodes.